Skip to content

feat(computer-use): TuriX-CUA inspired Interactive-View workflow + accuracy hardening#492

Merged
bobleer merged 1 commit intoGCWing:mainfrom
bobleer:cu/turix-cua-optimizations
Apr 23, 2026
Merged

feat(computer-use): TuriX-CUA inspired Interactive-View workflow + accuracy hardening#492
bobleer merged 1 commit intoGCWing:mainfrom
bobleer:cu/turix-cua-optimizations

Conversation

@bobleer
Copy link
Copy Markdown
Collaborator

@bobleer bobleer commented Apr 23, 2026

Summary

Inspired by the TuriX-CUA open-source project, this PR overhauls BitFun's desktop Computer Use stack on macOS so the agent can reliably see → plan → act → verify on real GUIs (Tauri WebViews, Cocoa apps, mini-apps embedded in BitFun itself).

The previous AX-only / coordinate-clicking flow had two systemic failure modes:

  1. The model couldn't tell which on-screen widget to act on (no labelled overlay), so it kept clicking the wrong region.
  2. Even when it picked the right widget, the click often didn't land — AX press not supported, view changed between observation and action, no fallback path.

What's in this PR

Interactive-View pipeline (S1-S4)

  • New types in bitfun-core: AxNodeInteractiveElementInteractiveView (with stable i index, role/label, image-pixel + screen geometry, optimistic-locking digest).
  • interactive_filter.rs — turns the raw macOS AX dump into a clean, indexed, deduplicated list of actionable elements.
  • som_overlay.rs — Set-of-Mark JPEG overlay that renders numbered bounding boxes on top of a focused-window screenshot (drop-in for the agent's vision).
  • desktop_host.rs — wires the above into build_interactive_view, on top of the new macos_ax_dump.

ControlHub desktop actions (S5)

Four new actions exposed via ControlHub, all addressed by a single i index plus a before_view_digest:

  • interactive_click (with click count, modifiers, mouse button)
  • interactive_type_text (focus an element by i, then type)
  • interactive_scroll
  • interactive_key_chord

Click reliability hardening

  • Digest stability: compute_interactive_view_digest now hashes only role + bucketed geometry; ignores transient focus/value/label jitter, so harmless re-renders no longer invalidate the view.
  • Auto-rebuild on STALE: interactive_click catches STALE_INTERACTIVE_VIEW, rebuilds the view once, retries with the new digest, and tags the result with auto_rebuilt_view_after_stale so the agent knows to re-verify.
  • Multi-channel fallback: AX press → image-pixel pointer click. If the AX path fails (widget not in AX tree, AXPress not supported, etc.), we fall back to app_click { target: ImageXy } at the element's image-pixel center; result is tagged fallback_image_xy.
  • Relaxed click_element validation: text_contains and node_idx are now valid lone locators (previously you had to also pass title_contains/role_substring/identifier_contains).

UI consolidation

  • Card-merging heuristic in interactive_filter: when an actionable container (AXCell, AXRow, AXButton, AXLink, AXGroup, …) fully contains smaller actionable children and is ≥1.5× their area, the children are absorbed. This dramatically cuts overlay clutter on real apps (table rows, list items, card grids).

Prompt update (claw_mode.md)

  • Replaces the old AX-first guidance with Interactive-View-first workflow.
  • Adds a mandatory OBSERVE → PLAN → EXPECT → VERIFY loop template the agent must follow on every interactive_* turn (single biggest accuracy lever in our internal tests).
  • Documents the new auto_rebuilt_view_after_stale / fallback_image_xy recovery notes and how to react to them.

Supporting macOS plumbing

  • macos_ax_dump.rs — non-throwing AX tree snapshot with cached AXUIElementRef lookup by node idx.
  • macos_ax_write.rs — safe AXPress wrapper (@try/@catch) returning structured AxWriteOutcome.
  • macos_bg_input.rs — background CGEvent-based mouse / scroll / keyboard / typing into a target pid (no cursor hijack).
  • macos_list_apps.rs — running app enumeration.
  • Plus recursion_limit = "256" for the new generic-heavy modules.

Test plan

  • cargo check -p bitfun-desktop -p bitfun-core — clean.
  • Unit tests: interactive_filter (incl. new card_container_absorbs_contained_actionable_children), desktop_host digest helpers.
  • Manual dogfooding on macOS: opening the BitFun "五子棋" mini-app via interactive_click, typing into search fields via interactive_type_text, scrolling lists via interactive_scroll — all succeed without falling back to coordinate guessing.
  • Reviewer dogfooding on a fresh macOS install (please flag any AX-permission UX issues).

Notes

  • macOS-only on this PR. Other platforms keep the existing trait defaults ("not available").
  • No protocol-breaking changes to existing computer_use_* tools — the new interactive_* actions live alongside them.

…curacy hardening

Inspired by the TuriX-CUA open-source project, this overhauls BitFun's
desktop Computer Use stack so the agent can reliably "see → plan →
act → verify" on macOS GUIs.

Highlights
- Interactive-View pipeline (S1-S4): new `AxNode`-derived
  `InteractiveElement` / `InteractiveView` types, AX-tree filtering
  (`interactive_filter.rs`), Set-of-Mark JPEG overlay
  (`som_overlay.rs`), and `desktop_host.rs` wiring on top of the
  macOS AX dump.
- ControlHub desktop actions (S5): four new `interactive_*` /
  `app_*` actions with a single `i` index and `before_view_digest`
  optimistic-locking.
- Click reliability: digest is now geometry/role-stable (ignores
  focus/value jitter); `interactive_click` auto-rebuilds the view on
  `STALE_INTERACTIVE_VIEW` and falls back from AX-press to
  image-pixel pointer click; `click_element` accepts `text_contains`
  / `node_idx` directly.
- Card-merging heuristic in `interactive_filter` collapses redundant
  child widgets inside actionable containers (cells, rows, buttons,
  links, groups), cutting overlay clutter on real apps.
- Prompt update (`claw_mode.md`): mandatory
  OBSERVE → PLAN → EXPECT → VERIFY loop and Interactive-View-first
  guidance.
- Supporting macOS plumbing: `macos_ax_dump`, `macos_ax_write`,
  `macos_bg_input`, `macos_list_apps` (background-input event
  injection, AX press, app enumeration).
- Adds `recursion_limit = "256"` for the new generic-heavy modules.

Tested with `cargo check -p bitfun-desktop -p bitfun-core` and
focused unit tests in `interactive_filter` and `desktop_host`.

Made-with: Cursor
@bobleer bobleer merged commit 068677f into GCWing:main Apr 23, 2026
4 checks passed
@bobleer bobleer deleted the cu/turix-cua-optimizations branch April 24, 2026 05:53
bobleer added a commit to bobleer/BitFun that referenced this pull request Apr 24, 2026
…curacy hardening (GCWing#492)

Inspired by the TuriX-CUA open-source project, this overhauls BitFun's
desktop Computer Use stack so the agent can reliably "see → plan →
act → verify" on macOS GUIs.

Highlights
- Interactive-View pipeline (S1-S4): new `AxNode`-derived
  `InteractiveElement` / `InteractiveView` types, AX-tree filtering
  (`interactive_filter.rs`), Set-of-Mark JPEG overlay
  (`som_overlay.rs`), and `desktop_host.rs` wiring on top of the
  macOS AX dump.
- ControlHub desktop actions (S5): four new `interactive_*` /
  `app_*` actions with a single `i` index and `before_view_digest`
  optimistic-locking.
- Click reliability: digest is now geometry/role-stable (ignores
  focus/value jitter); `interactive_click` auto-rebuilds the view on
  `STALE_INTERACTIVE_VIEW` and falls back from AX-press to
  image-pixel pointer click; `click_element` accepts `text_contains`
  / `node_idx` directly.
- Card-merging heuristic in `interactive_filter` collapses redundant
  child widgets inside actionable containers (cells, rows, buttons,
  links, groups), cutting overlay clutter on real apps.
- Prompt update (`claw_mode.md`): mandatory
  OBSERVE → PLAN → EXPECT → VERIFY loop and Interactive-View-first
  guidance.
- Supporting macOS plumbing: `macos_ax_dump`, `macos_ax_write`,
  `macos_bg_input`, `macos_list_apps` (background-input event
  injection, AX press, app enumeration).
- Adds `recursion_limit = "256"` for the new generic-heavy modules.

Tested with `cargo check -p bitfun-desktop -p bitfun-core` and
focused unit tests in `interactive_filter` and `desktop_host`.
bobleer added a commit that referenced this pull request Apr 24, 2026
* feat: improve ControlHub browser session handling (#476)

* feat: improve controlhub browser sessions

Tighten ControlHub browser session routing and desktop browser guards, improve relay reconnect handling, and persist FlowChat session title updates alongside model config polish.

Generated with BitFun

Co-Authored-By: BitFun

* fix: resolve SessionModule lint error

Convert requireSessionWorkspacePath to a function declaration so eslint no-use-before-define passes in SessionModule.

Generated with BitFun

Co-Authored-By: BitFun

* fix: handle Windows cert DER bytes correctly

Replace the invalid to_der().ok() call with direct DER byte conversion so bitfun-core compiles on Windows CI.

Generated with BitFun

Co-Authored-By: BitFun

* fix(web-ui): add horizontal padding to CLI auth empty state in AI model settings (#478)

Co-authored-by: bowen628 <bowen628@noreply.gitcode.com>

* Add browser restart confirm flow (#479)

* chore: remove selfcontrol integration (#480)

* feat(computer-use): TuriX-CUA inspired Interactive-View workflow + accuracy hardening (#492)

Inspired by the TuriX-CUA open-source project, this overhauls BitFun's
desktop Computer Use stack so the agent can reliably "see → plan →
act → verify" on macOS GUIs.

Highlights
- Interactive-View pipeline (S1-S4): new `AxNode`-derived
  `InteractiveElement` / `InteractiveView` types, AX-tree filtering
  (`interactive_filter.rs`), Set-of-Mark JPEG overlay
  (`som_overlay.rs`), and `desktop_host.rs` wiring on top of the
  macOS AX dump.
- ControlHub desktop actions (S5): four new `interactive_*` /
  `app_*` actions with a single `i` index and `before_view_digest`
  optimistic-locking.
- Click reliability: digest is now geometry/role-stable (ignores
  focus/value jitter); `interactive_click` auto-rebuilds the view on
  `STALE_INTERACTIVE_VIEW` and falls back from AX-press to
  image-pixel pointer click; `click_element` accepts `text_contains`
  / `node_idx` directly.
- Card-merging heuristic in `interactive_filter` collapses redundant
  child widgets inside actionable containers (cells, rows, buttons,
  links, groups), cutting overlay clutter on real apps.
- Prompt update (`claw_mode.md`): mandatory
  OBSERVE → PLAN → EXPECT → VERIFY loop and Interactive-View-first
  guidance.
- Supporting macOS plumbing: `macos_ax_dump`, `macos_ax_write`,
  `macos_bg_input`, `macos_list_apps` (background-input event
  injection, AX press, app enumeration).
- Adds `recursion_limit = "256"` for the new generic-heavy modules.

Tested with `cargo check -p bitfun-desktop -p bitfun-core` and
focused unit tests in `interactive_filter` and `desktop_host`.

* feat: align Codex client_version with local CLI; honor proxy in AI config tests (#499)

- Resolve codex CLI version via codex --version for User-Agent and backend model discovery

- Derive client_version query param from User-Agent in OpenAI common adapter

- Use create_transient_ai_client_for_config for test/list-models so proxy and stream options apply

* Fix stale Remote SSH restore entries (#501)

* fix: adapt agentic_os main sync

---------

Co-authored-by: bowen628 <bowen628@noreply.gitcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant